The working directory

R has a powerful notion of the working directory. This is where R looks for files that you ask it to load, and where it will put any files that you ask it to save. RStudio shows your current working directory at the top of the console and you can print this out in R code by running getwd():

The working directory is an important concept to understand. It is the place from where R will be looking for and saving the files. When you write code for your project, it should refer to files in relation to the root of your working directory and only need files within this structure.

It is good practice to keep a set of related data, analyses, and text self-contained in the working directory. All of the scripts within this folder can then use relative paths to files that indicate where inside the project a file is located (as opposed to absolute paths, which point to where a file is on a specific computer). Working this way makes it a lot easier to move your project around on your computer and share it with others without worrying about whether or not the underlying scripts will still work.

R Studio Projects

RStudio provides a helpful set of tools to do this through its “Projects” interface, which not only creates a working directory for you, but also remembers its location (allowing you to quickly navigate to it) and optionally preserves custom settings and open files to make it easier to resume work after a break. Go through the steps for creating an “R Project” for this tutorial below.

  1. Start RStudio.
  2. Under the File menu, click on New Project. Choose New Directory, then New Project.
  3. Enter a name for this new folder (or “directory”), and choose a convenient location for it. This will be your working directory for the rest of the day (e.g., ~/CalAcademy-R-intro-project).
  4. Click on Create Project.

Using RStudio projects makes this easy and ensures that your working directory is set properly. If you need to check it, you can use getwd(). If for some reason your working directory is not what it should be, you can change it in the RStudio interface by navigating in the file browser where your working directory should be, and clicking on the blue gear icon “More”, and select “Set As Working Directory”. Alternatively you can use setwd("/path/to/working/directory") to reset your working directory. However, your scripts should not include this line because it will fail on someone else’s computer.

Organizing your working directory

Using a consistent folder structure across your projects will help keep things organized, and will also make it easy to find/file things in the future. This can be especially helpful when you have multiple projects. In general, you may create directories (folders) for code, data, and docs.

  • data/ Use this folder to store your raw data and intermediate datasets you may create for the need of a particular analysis. For the sake of transparency and provenance, you should always keep a copy of your raw data accessible and do as much of your data cleanup and preprocessing programmatically (i.e., with scripts, rather than manually) as possible. Separating raw data from processed data is also a good idea. For example, you could have files data/raw/tree_survey.plot1.txt and ...plot2.txt kept separate from a data/processed/tree.survey.csv file generated by the scripts/01.preprocess.tree_survey.R script.
  • code/ This would be the location to keep your R scripts for different analyses or plotting, and potentially a separate folder for your functions.
  • docs/ This would be a place to keep outlines, drafts, and other text.
  • figures/ This would be a place to keep all graphic outputs generated.
  • outputs/ This would be a place to keep all non-graphic outputs generated (e.g. tables and other R objects).

You may want additional directories or subdirectories depending on your project needs, but these should form the backbone of your working directory.

For this workshop, we will need a data/ folder to store our raw data, a code/ folder to store our R scripts, outputs/ to export data and other objects, and figures/ folder for the figures that we will save.

  • A data folder with the data we will be analyzing already exists in the “R_workshops-Cal_Academy-Apr2019” root folder. You can copy this folder directly to your newly created working directory (e.g., ~/CalAcademy-R-intro) by selecting the data folder in the bottom right R Studio panel (by ticking the square box on the left of the folder name), then clicking on “More” under the Files tab, then “Copy to”, and navigate to your newly created working directory

  • Next, to create additional folders in your directory, under the Files tab on the right of the screen, click on New Folder and create a folder named code within your newly created working directory (e.g., ~/CalAcademy-R-intro/code). (Alternatively, type dir.create("data") at your R console.) Repeat these operations to create a code/, outputs/ and figures folders.

Importing data

We are now ready to have a look at some data! We are going to be exploring iNaturalist data collected for the 2019 City Nature Challenge in the cities of San Francisco and Los Angeles (nope… no particular reason - just two random cities!).

To import the data, we will use the read.csv function:

This statement doesn’t produce any output because, as you might recall, assignments don’t display anything. If we want to check that our data has been loaded, we can see the contents of the data frame by typing its name: CNC_2019_observations.

Wow… that was a lot of output. At least it means the data loaded properly. Let’s check the top (the first 6 lines) of this data frame using the function head():

#>         id               observed_on_string observed_on
#> 1 23077130      2019-04-26 12:09:45 AM SAST  2019-04-26
#> 2 23077855      2019-04-26 12:07:17 AM SAST  2019-04-26
#> 3 23078215      2019-04-26 12:46:48 AM SAST  2019-04-26
#> 4 23078382 2019-04-26 12:04:55 AM GMT+02:00  2019-04-26
#> 5 23078661 2019-04-26 12:11:49 AM GMT+02:00  2019-04-26
#> 6 23078727 2019-04-26 12:09:25 AM GMT+02:00  2019-04-26
#>          time_observed_at           time_zone user_id    user_login
#> 1 2019-04-25 22:09:45 UTC Africa/Johannesburg 1667099      hhodgson
#> 2 2019-04-25 22:07:17 UTC Africa/Johannesburg 1667099      hhodgson
#> 3 2019-04-25 22:46:48 UTC Africa/Johannesburg  748780       lmossop
#> 4 2019-04-25 22:04:55 UTC Africa/Johannesburg 1667260 andrewhodgson
#> 5 2019-04-25 22:11:49 UTC Africa/Johannesburg 1667260 andrewhodgson
#> 6 2019-04-25 22:09:25 UTC Africa/Johannesburg 1667260 andrewhodgson
#>                created_at              updated_at quality_grade  license
#> 1 2019-04-25 22:27:57 UTC 2019-07-22 12:35:50 UTC      research CC-BY-NC
#> 2 2019-04-25 22:41:23 UTC 2020-03-28 12:12:30 UTC      research CC-BY-NC
#> 3 2019-04-25 22:49:02 UTC 2019-04-25 22:49:23 UTC      needs_id CC-BY-NC
#> 4 2019-04-25 22:52:11 UTC 2020-05-06 19:37:35 UTC      research CC-BY-NC
#> 5 2019-04-25 22:57:39 UTC 2020-05-06 19:37:36 UTC      research CC-BY-NC
#> 6 2019-04-25 22:58:56 UTC 2020-05-06 19:37:36 UTC      research CC-BY-NC
#>                                                 url sound_url tag_list
#> 1 https://www.inaturalist.org/observations/23077130                   
#> 2 https://www.inaturalist.org/observations/23077855                   
#> 3 https://www.inaturalist.org/observations/23078215                   
#> 4 https://www.inaturalist.org/observations/23078382                   
#> 5 https://www.inaturalist.org/observations/23078661                   
#> 6 https://www.inaturalist.org/observations/23078727                   
#>   description num_identification_agreements
#> 1                                         4
#> 2                                         3
#> 3                                         0
#> 4                                         4
#> 5                                         8
#> 6                                         4
#>   num_identification_disagreements captive_cultivated oauth_application_id
#> 1                                0              false                    2
#> 2                                0              false                    2
#> 3                                0              false                    2
#> 4                                0              false                    2
#> 5                                0              false                    2
#> 6                                0              false                    2
#>                                                 place_guess  latitude
#> 1 50 Yarmouth Rd, Muizenberg, Cape Town, 7950, South Africa -34.09706
#> 2 50 Yarmouth Rd, Muizenberg, Cape Town, 7950, South Africa -34.09692
#> 3                   Cape Peninsula, Cape Town, South Africa -34.20994
#> 4                              City of Cape Town, ZA-WC, ZA -34.09714
#> 5                              City of Cape Town, ZA-WC, ZA -34.09711
#> 6                              City of Cape Town, ZA-WC, ZA -34.09711
#>   longitude positional_accuracy geoprivacy taxon_geoprivacy
#> 1  18.47416                   5                            
#> 2  18.47429                   9                            
#> 3  18.40169                  25                            
#> 4  18.47420                   9                            
#> 5  18.47426                  66                            
#> 6  18.47423                  33                            
#>   coordinates_obscured positioning_method positioning_device
#> 1                false                gps                gps
#> 2                false                gps                gps
#> 3                false                                      
#> 4                false                gps                gps
#> 5                false                gps                gps
#> 6                false                gps                gps
#>   place_town_name place_county_name place_state_name place_country_name
#> 1            <NA>      Simon's Town             <NA>       South Africa
#> 2            <NA>      Simon's Town             <NA>       South Africa
#> 3            <NA>      Simon's Town             <NA>       South Africa
#> 4            <NA>      Simon's Town             <NA>       South Africa
#> 5            <NA>      Simon's Town             <NA>       South Africa
#> 6            <NA>      Simon's Town             <NA>       South Africa
#>   place_admin1_name place_admin2_name                   species_guess
#> 1      Western Cape City of Cape Town         Marbled Leaf-toed Gecko
#> 2      Western Cape City of Cape Town Hairy Golden Orb-weaving Spider
#> 3      Western Cape City of Cape Town           Butterflies and Moths
#> 4      Western Cape City of Cape Town         Marbled leaf-toed gecko
#> 5      Western Cape City of Cape Town         Marbled leaf-toed gecko
#> 6      Western Cape City of Cape Town         Marbled leaf-toed gecko
#>            scientific_name                     common_name
#> 1     Afrogecko porphyreus         Marbled Leaf-toed Gecko
#> 2 Trichonephila fenestrata Hairy Golden Orb-weaving Spider
#> 3              Lepidoptera           Butterflies and Moths
#> 4     Afrogecko porphyreus         Marbled Leaf-toed Gecko
#> 5     Afrogecko porphyreus         Marbled Leaf-toed Gecko
#> 6     Afrogecko porphyreus         Marbled Leaf-toed Gecko
#>   iconic_taxon_name taxon_id taxon_kingdom_name taxon_phylum_name
#> 1          Reptilia    93486           Animalia          Chordata
#> 2         Arachnida   904338           Animalia        Arthropoda
#> 3           Insecta    47157           Animalia        Arthropoda
#> 4          Reptilia    93486           Animalia          Chordata
#> 5          Reptilia    93486           Animalia          Chordata
#> 6          Reptilia    93486           Animalia          Chordata
#>   taxon_subphylum_name taxon_superclass_name taxon_class_name
#> 1           Vertebrata                               Reptilia
#> 2          Chelicerata                              Arachnida
#> 3             Hexapoda                                Insecta
#> 4           Vertebrata                               Reptilia
#> 5           Vertebrata                               Reptilia
#> 6           Vertebrata                               Reptilia
#>   taxon_subclass_name taxon_superorder_name taxon_order_name
#> 1                                                   Squamata
#> 2                                                    Araneae
#> 3           Pterygota                            Lepidoptera
#> 4                                                   Squamata
#> 5                                                   Squamata
#> 6                                                   Squamata
#>   taxon_suborder_name taxon_superfamily_name taxon_family_name
#> 1              Sauria                               Gekkonidae
#> 2       Araneomorphae             Araneoidea         Araneidae
#> 3                                                             
#> 4              Sauria                               Gekkonidae
#> 5              Sauria                               Gekkonidae
#> 6              Sauria                               Gekkonidae
#>   taxon_subfamily_name taxon_supertribe_name taxon_tribe_name
#> 1                                                            
#> 2           Nephilinae                                       
#> 3                                                            
#> 4                                                            
#> 5                                                            
#> 6                                                            
#>   taxon_subtribe_name taxon_genus_name taxon_genushybrid_name
#> 1                            Afrogecko                       
#> 2                        Trichonephila                       
#> 3                                                            
#> 4                            Afrogecko                       
#> 5                            Afrogecko                       
#> 6                            Afrogecko                       
#>         taxon_species_name taxon_hybrid_name taxon_subspecies_name
#> 1     Afrogecko porphyreus                                        
#> 2 Trichonephila fenestrata                                        
#> 3                                                                 
#> 4     Afrogecko porphyreus                                        
#> 5     Afrogecko porphyreus                                        
#> 6     Afrogecko porphyreus                                        
#>   taxon_variety_name taxon_form_name      city
#> 1                                    Cape Town
#> 2                                    Cape Town
#> 3                                    Cape Town
#> 4                                    Cape Town
#> 5                                    Cape Town
#> 6                                    Cape Town

Note

read.csv assumes that fields are delineated by commas, however, in several countries, the comma is used as a decimal separator and the semicolon (;) is used as a field delineator. If you want to read in this type of files in R, you can use the read.csv2 function. It behaves exactly like read.csv but uses different parameters for the decimal and the field separators. If you are working with another format, they can be both specified by the user. Check out the help for read.csv() by typing ?read.csv to learn more. There is also the read.delim() for in tab separated data files. It is important to note that all of these functions are actually wrapper functions for the main read.table() function with different arguments. As such, the CNC_2019_observations data above could have also been loaded by using read.table() with the separation argument as ,. The code is as follows: CNC_2019_observations <- read.table(file="data/CNC_2019_observations.csv", sep=",", header=TRUE). The header argument has to be set to TRUE to be able to read the headers as by default read.table() has the header argument set to FALSE.

What are data frames?

Data frames are the de facto data structure for most tabular data, and what we use for statistics and plotting.

A data frame can be created by hand, but most commonly they are generated by the functions read.csv() or read.table(); in other words, when importing spreadsheets from your hard drive (or the web).

A data frame is the representation of data in the format of a table where the columns are vectors that all have the same length. Because columns are vectors, each column must contain a single type of data (e.g., characters, integers, factors). For example, here is a figure depicting a data frame comprising a numeric, a character, and a logical vector.

We can see this when inspecting the structure of a data frame with the function str():

Inspecting data.frame Objects

We already saw how the functions head() and str() can be useful to check the content and the structure of a data frame. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data. Let’s try them out!

  • Size:
    • dim(CNC_2019_observations) - returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
    • nrow(CNC_2019_observations) - returns the number of rows
    • ncol(CNC_2019_observations) - returns the number of columns
  • Content:
    • head(CNC_2019_observations) - shows the first 6 rows
    • tail(CNC_2019_observations) - shows the last 6 rows
  • Names:
    • names(CNC_2019_observations) - returns the column names (synonym of colnames() for data.frame objects)
    • rownames(CNC_2019_observations) - returns the row names
  • Summary:
    • str(CNC_2019_observations) - structure of the object and information about the class, length and content of each column
    • summary(CNC_2019_observations) - summary statistics for each column

Note: most of these functions are “generic”, they can be used on other types of objects besides data.frame.

Indexing and subsetting data frames

Our CNC_2019_observations data frame has rows and columns (it has 2 dimensions), if we want to extract some specific data from it, we need to specify the “coordinates” we want from it. Row numbers come first, followed by column numbers. However, note that different ways of specifying these coordinates lead to results with different classes.

You can also exclude certain indices of a data frame using the “-” sign:

Data frames can be subset by calling indices (as shown previously), but also by calling their column names directly:

In RStudio, you can use the autocompletion feature to get the full and correct names of the columns.

Factors

When looking at str(CNC_2019_observations) we can see that several of the columns consist of integers. However, the columns observed_on, scientific_name, common_name, taxon_kingdom_name, taxon_phylum_name, … are of a special class called factor. Factors are very useful and actually contribute to making R particularly well suited to working with data. So we are going to spend a little time introducing them.

Factors represent categorical data. They are stored as integers associated with labels and they can be ordered or unordered. While factors look (and often behave) like character vectors, they are actually treated as integer vectors by R. So you need to be very careful when treating them as strings.

Once created, factors can only contain a pre-defined set of values, known as levels. By default, R always sorts levels in alphabetical order. For instance, if you have a factor with 3 levels:

R will assign 1 to the level "casual", 2 to the level "needs_id" and 3 to the level "research" (because c comes before n which comes before r, even though the first element in this vector is "research"). You can see this by using the function levels() and you can find the number of levels using nlevels():

Sometimes, the order of the factors does not matter, other times you might want to specify the order because it is meaningful (e.g., “low”, “medium”, “high”), it improves your visualization, or it is required by a particular type of analysis. Here, one way to reorder our levels in the quality_grade vector would be:

#> [1] research needs_id research casual   needs_id
#> Levels: casual needs_id research
#> [1] research needs_id research casual   needs_id
#> Levels: research needs_id casual

In R’s memory, these factors are represented by integers (1, 2, 3), but are more informative than integers because factors are self describing: "research", "needs_id" and "casual" are more descriptive than 1, 2 and 3. Which observations are “research”? You wouldn’t be able to tell just from the integer data. Factors, on the other hand, have this information built in. It is particularly helpful when there are many levels (such as in the "scientific_name" field).

Data Manipulation using dplyr and tidyr

Bracket subsetting is handy, but it can be cumbersome and difficult to read, especially for complicated operations. Enter dplyr. dplyr is a package for making tabular data manipulation easier. It pairs nicely with tidyr which enables you to swiftly convert between different data formats for plotting and analysis.

Packages in R are basically sets of additional functions that let you do more stuff. The functions we’ve been using so far, like str() or data.frame(), come built into R; packages give you access to more of them. Before you use a package for the first time you need to install it on your machine, and then you should import it in every subsequent R session when you need it. You should already have installed the tidyverse package. This is an “umbrella-package” that installs several packages useful for data analysis which work together well such as tidyr, dplyr, ggplot2, tibble, etc.

The tidyverse package tries to address 3 common issues that arise when doing data analysis with some of the functions that come with R:

  1. The results from a base R function sometimes depend on the type of data.
  2. Using R expressions in a non standard way, which can be confusing for new learners.
  3. Hidden arguments, having default operations that new learners are not aware of.

We have seen in our previous lesson that when building or importing a data frame, the columns that contain characters (i.e., text) are coerced (=converted) into the factor data type. We had to set stringsAsFactors to FALSE to avoid this hidden argument to convert our data type.

This time we will use the tidyverse package to read the data and avoid having to set stringsAsFactors to FALSE

To load the package type:

What are dplyr and tidyr?

The package dplyr provides easy tools for the most common data manipulation tasks. It is built to work directly with data frames, with many common tasks optimized by being written in a compiled language (C++). An additional feature is the ability to work directly with data stored in an external database. The benefits of doing this are that the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of the query are returned.

This addresses a common problem with R in that all operations are conducted in-memory and thus the amount of data you can work with is limited by available memory. The database connections essentially remove that limitation in that you can connect to a database of many hundreds of GB, conduct queries on it directly, and pull back into R only what you need for analysis.

The package tidyr addresses the common problem of wanting to reshape your data for plotting and use by different R functions. Sometimes we want data sets where we have one row per measurement. Sometimes we want a data frame where each measurement type has its own column, and rows are instead more aggregated groups - like plots or aquaria. Moving back and forth between these formats is nontrivial, and tidyr gives you tools for this and more sophisticated data manipulation.

To learn more about dplyr and tidyr after the workshop, you may want to check out this handy data transformation with dplyr cheatsheet and this one about tidyr.

We’ll read in our data using the read_csv() function, from the tidyverse package readr, instead of read.csv().

#> Parsed with column specification:
#> cols(
#>   .default = col_character(),
#>   id = col_double(),
#>   observed_on = col_date(format = ""),
#>   user_id = col_double(),
#>   sound_url = col_logical(),
#>   num_identification_agreements = col_double(),
#>   num_identification_disagreements = col_double(),
#>   captive_cultivated = col_logical(),
#>   oauth_application_id = col_double(),
#>   latitude = col_double(),
#>   longitude = col_double(),
#>   positional_accuracy = col_double(),
#>   coordinates_obscured = col_logical(),
#>   place_town_name = col_logical(),
#>   place_state_name = col_logical(),
#>   taxon_id = col_double(),
#>   taxon_supertribe_name = col_logical(),
#>   taxon_genushybrid_name = col_logical(),
#>   taxon_form_name = col_logical()
#> )
#> See spec(...) for full column specifications.
#> Warning: 128680 parsing failures.
#>  row             col           expected                                                     actual                                  file
#> 1038 taxon_form_name 1/0/T/F/TRUE/FALSE Sambucus nigra laciniata                                   'data/CNC-2019-observations-top3.csv'
#> 4310 sound_url       1/0/T/F/TRUE/FALSE https://static.inaturalist.org/sounds/35826.mp3?1556297829 'data/CNC-2019-observations-top3.csv'
#> 4443 sound_url       1/0/T/F/TRUE/FALSE https://static.inaturalist.org/sounds/35825.mp3?1556297726 'data/CNC-2019-observations-top3.csv'
#> 4780 sound_url       1/0/T/F/TRUE/FALSE https://static.inaturalist.org/sounds/35843.mp3?1556301132 'data/CNC-2019-observations-top3.csv'
#> 4804 sound_url       1/0/T/F/TRUE/FALSE https://static.inaturalist.org/sounds/35847.mp3?1556301537 'data/CNC-2019-observations-top3.csv'
#> .... ............... .................. .......................................................... .....................................
#> See problems(...) for more details.

Notice that the class of the data is now tbl_df

This is referred to as a “tibble”. Tibbles tweak some of the behaviors of the data frame objects we introduced earlier. The data structure is very similar to a data frame. For our purposes the only differences are that:

  1. In addition to displaying the data type of each column under its name, it only prints the first few rows of data and only as many columns as fit on one screen.
  2. Columns of class character are never converted into factors.

We’re going to learn some of the most common dplyr functions:

  • select(): subset columns
  • filter(): subset rows on conditions
  • mutate(): create new columns by using information from other columns
  • group_by() and summarize(): create summary statisitcs on grouped data
  • arrange(): sort results
  • count(): count discrete values

Selecting columns and filtering rows

To select columns of a data frame, use select(). The first argument to this function is the data frame (CNC_2019_observations), and the subsequent arguments are the columns to keep.

To select all columns except certain ones, put a “-” in front of the variable to exclude it.

This will select all the variables in CNC_2019_observations except positional_accuracy and user_login.

To choose rows based on a specific criteria, use filter():

Pipes

What if you want to select and filter at the same time? There are three ways to do this: use intermediate steps, nested functions, or pipes.

With intermediate steps, you create a temporary data frame and use that as input to the next function, like this:

This is readable, but can clutter up your workspace with lots of objects that you have to name individually. With multiple steps, that can be hard to keep track of.

You can also nest functions (i.e. one function inside of another), like this:

This is handy, but can be difficult to read if too many functions are nested, as R evaluates the expression from the inside out (in this case, filtering, then selecting).

The last option, pipes, are a recent addition to R. Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset. Pipes in R look like %>% and are made available via the magrittr package, installed automatically with dplyr. If you use RStudio, you can type the pipe with Ctrl + Shift + M if you have a PC or Cmd + Shift + M if you have a Mac.

In the above code, we use the pipe to send the CNC_2019_observations dataset first through filter() to keep observations made in San Francisco, then through select() to keep only the scientific_nameand observed_on columns. Since %>% takes the object on its left and passes it as the first argument to the function on its right, we don’t need to explicitly include the data frame as an argument to the filter() and select() functions any more.

Some may find it helpful to read the pipe like the word “then”. For instance, in the above example, we took the data frame CNC_2019_observations, then we filtered for rows with city == "San Francisco", then we selected columns scientific_name and observed_on. The dplyr functions by themselves are somewhat simple, but by combining them into linear workflows with the pipe, we can accomplish more complex manipulations of data frames.

If we want to create a new object with this smaller version of the data, we can assign it a new name:

Note that the final data frame is the leftmost part of this expression.

Challenge

Using pipes, subset the CNC_2019_observations data to include all animal observations with a positional accuracy less than 50 and retain only the columns scientific_name, taxon_kingdom_name, and city.

Mutate

Frequently you’ll want to create new columns based on the values in existing columns, for example to do unit conversions, or to find the ratio of values in two columns. For this we’ll use mutate().

To create a new column of weight in kg:

You can also create a second new column based on the first new column within the same call of mutate():

The summarize() function

group_by() is often used together with summarize(), which collapses each group into a single-row summary of that group. group_by() takes as arguments the column names that contain the categorical variables for which you want to calculate the summary statistics. Let’s say we want to find out the mean latitude and longitude of observations by city:

You may also have noticed that the output from these calls doesn’t run off the screen anymore. It’s one of the advantages of tbl_df over data frame.

You can also group by multiple columns:

Counting

When working with data, we often want to know the number of observations found for each factor or combination of factors. For this task, dplyr provides count(). For example, if we wanted to count the number of observations for each kingdom, we would do:

The count() function is shorthand for something we’ve already seen: grouping by a variable, and summarizing it by counting the number of observations in that group. In other words, CNC_2019_observations %>% count() is equivalent to:

Previous example shows the use of count() to count the number of rows/observations for one factor (i.e., sex). If we wanted to count combination of factors, such as city and taxon_kingdom_name, we would specify the first and the second factor as the arguments of count():

With the above code, we can proceed with arrange() to sort the table according to a number of criteria so that we have a better comparison. For instance, we might want to arrange the table above in (i) an alphabetical order of the levels of the species and (ii) in descending order of the count:

In the table above, we can see that there are a number of observations that have not been assigned a kingdom name (i.e. NA). These are probably observations from the casual and needs_id quality_grade categories. To focus on research_grade observations, we can start our pipeline with a filter() statement. In addition, we can reorder the columns in the output in a way that makes more sense for us with select():

And so, we can clearly see that, in the CNC 2019, Cape Town made the most observations for all kingdoms, except for fungi and protozoa. Well done Cape Town!

Challenge

  1. How many species were observed in SF versus other cities in total? (hint: see ?n_distinct)
  1. What was the most observed species in each city? (hint: combine count() and arrange())

Joins

Understanding joins

It’s rare that a data analysis involves only a single table of data. Typically you have many tables of data, and you must combine them to answer the questions that you’re interested in. Collectively, multiple tables of data are called relational data because it is the relations, not just the individual datasets, that are important.

Combining two tables is most often done using joins. To help you learn how joins work, let’s use a visual representation:

The coloured column represents the “key” variable: these are used to match the rows between the tables. The grey column represents the “value” column that is carried along for the ride. In these examples I’ll show a single key variable, but the idea generalises in a straightforward way to multiple keys and multiple values.

A join is a way of connecting each row in x to zero, one, or more rows in y. The following diagram shows each potential match as an intersection of a pair of lines.

(If you look closely, you might notice that we’ve switched the order of the key and value columns in x. This is to emphasise that joins match based on the key; the value is just carried along for the ride.)

In an actual join, matches will be indicated with dots. The number of dots = the number of matches = the number of rows in the output.

Left joins

There are multiple types of joins, but the most commonly used join is the left join: you use this whenever you look up additional data from another table, because it preserves the original observations even when there isn’t a match. The left join should be your default join: use it unless you have a strong reason to prefer one of the others. Let’s have a look at how left joins work by loading some additional data on iNaturalist images for City Nature Challenge 2019 observations.

#> Parsed with column specification:
#> cols(
#>   id = col_double(),
#>   image_url = col_character()
#> )

Now let’s join the image data in CNC_2019_images with the other observation data in CNC_2019_observations using the field “id” as our key.

Note how CNC_2019_data has the same number of rows as CNC_2019_observations but 3 additional columns - the number of columns of CNC_2019_images minus the shared key field “id”.

You can find a comprehensive guide to joins here.

Exporting data

Now that you have learned how to use dplyr to extract information from or summarize your raw data, you may want to export these new data sets to share them with your collaborators or for archival.

Similar to the read_csv() function used for reading CSV files into R, there is a write_csv() function that generates CSV files from data frames.

To store data output, we have already created a folder named outputs in our working directory. We don’t want to write generated datasets in the same directory as our raw data. It’s good practice to keep them separate. The data folder should only contain the raw, unaltered data, and should be left alone to make sure we don’t delete or modify it. In contrast, our script will generate the contents of the outputs directory, so even if the files it contains are deleted, we can always re-generate them.

So, let’s save CNC_2019_data as a CSV file in our outputs folder.